AITopics | deterministic policy gradient

Collaborating Authors

deterministic policy gradient

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Softmax Deep Double Deterministic Policy Gradients

Neural Information Processing SystemsDec-24-2025, 06:28:25 GMT

A widely-used actor-critic reinforcement learning algorithm for continuous control, Deep Deterministic Policy Gradients (DDPG), suffers from the overestimation problem, which can negatively affect the performance. Although the state-of-the-art Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm mitigates the overestimation issue, it can lead to a large underestimation bias. In this paper, we propose to use the Boltzmann softmax operator for value function estimation in continuous control. We first theoretically analyze the softmax operator in continuous action space. Then, we uncover an important property of the softmax operator in actor-critic algorithms, i.e., it helps to smooth the optimization landscape, which sheds new light on the benefits of the operator. We also design two new algorithms, Softmax Deep Deterministic Policy Gradients (SD2) and Softmax Deep Double Deterministic Policy Gradients (SD3), by building the softmax operator upon single and double estimators, which can effectively improve the overestimation and underestimation bias. We conduct extensive experiments on challenging continuous control tasks, and results show that SD3 outperforms state-of-the-art methods.

deep double deterministic policy gradient, deterministic policy gradient, policy gradient, (8 more...)

Neural Information Processing Systems

Genre: Research Report (0.60)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

Shixiang (Shane) Gu, Timothy Lillicrap, Richard E. Turner, Zoubin Ghahramani, Bernhard Schölkopf, Sergey Levine

Neural Information Processing SystemsNov-21-2025, 12:06:59 GMT

Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging on-and off-policy updates for deep reinforcement learning.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report > New Finding (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Quasi-Newton Compatible Actor-Critic for Deterministic Policies

Kordabad, Arash Bahari, Brandner, Dean, Gros, Sebastien, Lucia, Sergio, Soudjani, Sadegh

arXiv.org Artificial IntelligenceNov-13-2025

In this paper, we propose a second-order deterministic actor-critic framework in reinforcement learning that extends the classical deterministic policy gradient method to exploit curvature information of the performance function. Building on the concept of compatible function approximation for the critic, we introduce a quadratic critic that simultaneously preserves the true policy gradient and an approximation of the performance Hessian. A least-squares temporal difference learning scheme is then developed to estimate the quadratic critic parameters efficiently. This construction enables a quasi-Newton actor update using information learned by the critic, yielding faster convergence compared to first-order methods. The proposed approach is general and applicable to any differentiable policy class. Numerical examples demonstrate that the method achieves improved convergence and performance over standard deterministic actor-critic baselines.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2511.09509

Country:

Europe > Germany (0.28)
Europe > United Kingdom (0.28)

Genre: Research Report (0.50)

Industry: Energy (0.96)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Softmax Deep Double Deterministic Policy Gradients Ling Pan

Neural Information Processing SystemsAug-15-2025, 00:27:45 GMT

We first theoretically analyze the softmax operator in continuous action space.

algorithm, estimation, softmax operator, (15 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > Canada (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > Promising Solution (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Finetuning Deep Reinforcement Learning Policies with Evolutionary Strategies for Control of Underactuated Robots

Calì, Marco, Sinigaglia, Alberto, Turcato, Niccolò, Carli, Ruggero, Susto, Gian Antonio

arXiv.org Artificial IntelligenceJul-15-2025

Deep Reinforcement Learning (RL) has emerged as a powerful method for addressing complex control problems, particularly those involving underactuated robotic systems. However, in some cases, policies may require refinement to achieve optimal performance and robustness aligned with specific task objectives. In this paper, we propose an approach for fine-tuning Deep RL policies using Evolutionary Strategies (ES) to enhance control performance for underactuated robots. Our method involves initially training an RL agent with Soft-Actor Critic (SAC) using a surrogate reward function designed to approximate complex specific scoring metrics. We subsequently refine this learned policy through a zero-order optimization step employing the Separable Natural Evolution Strategy (SNES), directly targeting the original score. Experimental evaluations conducted in the context of the 2nd AI Olympics with RealAIGym at IROS 2024 demonstrate that our evolutionary fine-tuning significantly improves agent performance while maintaining high robustness. The resulting controllers outperform established baselines, achieving competitive scores for the competition tasks.

evolutionary algorithm, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2507.1003

Country: Europe (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Add feedback

Equivalence of stochastic and deterministic policy gradients

Todorov, Emo

arXiv.org Artificial IntelligenceJun-11-2025

Policy gradients methods [1, 3, 4, 6, 7] are now widely used, and have produced impressive empirical results. In continuous control, they have been derived for both stochastic [3] and deterministic [7] policies. The resulting algorithms have different strengths and weaknesses, even though there are a number of similarities in how they are applied in practice, and recent generalizations [10, 13] have pointed to even deeper similarities. Here we study the relationship between the two. First we focus on MDPs with Gaussian control noise and quadratic control cost (while the dynamics and state cost remain general) and show that deterministic and stochastic policy gradients for such MDPs are equivalent. We then develop a much more general result, where any MDP with stochastic policy can be converted into an equivalent MDP with deterministic policy. The new MDP has the same state and policy parameters, but a different control space - namely the sufficient statistics of the stochastic policy in the original MDP. The only quantities that are not equivalent (in either the special or general case) are the state-control value functions. The latter observation suggests that policy gradient methods can be unified by approximating state value functions, instead of the common practice of approximating state-control value functions.

artificial intelligence, gradient, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2505.23244

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.96)

Add feedback

Monte Carlo Beam Search for Actor-Critic Reinforcement Learning in Continuous Control

Alzorgan, Hazim, Razi, Abolfazl

arXiv.org Artificial IntelligenceMay-15-2025

Actor-critic methods, like Twin Delayed Deep Deterministic Policy Gradient (TD3), depend on basic noise-based exploration, which can result in less than optimal policy convergence. In this study, we introduce Monte Carlo Beam Search (MCBS), a new hybrid method that combines beam search and Monte Carlo rollouts with TD3 to improve exploration and action selection. MCBS produces several candidate actions around the policy's output and assesses them through short-horizon rollouts, enabling the agent to make better-informed choices. We test MCBS across various continuous-control benchmarks, including HalfCheetah-v4, Walker2d-v5, and Swimmer-v5, showing enhanced sample efficiency and performance compared to standard TD3 and other baseline methods like SAC, PPO, and A2C. Our findings emphasize MCBS's capability to enhance policy learning through structured look-ahead search while ensuring computational efficiency. Additionally, we offer a detailed analysis of crucial hyperparameters, such as beam width and rollout depth, and explore adaptive strategies to optimize MCBS for complex control tasks. Our method shows a higher convergence rate across different environments compared to TD3, SAC, PPO, and A2C. For instance, we achieved 90% of the maximum achievable reward within around 200 thousand timesteps compared to 400 thousand timesteps for the second-best method.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2505.09029

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Wasserstein Policy Optimization

Pfau, David, Davies, Ian, Borsa, Diana, Araujo, Joao G. M., Tracey, Brendan, van Hasselt, Hado

arXiv.org Artificial IntelligenceMay-2-2025

We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of the gradient of the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions -- without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2505.00663

Country: North America (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Robust Deterministic Policy Gradient for Disturbance Attenuation and Its Application to Quadrotor Control

Lee, Taeho, Lee, Donghwan

arXiv.org Artificial IntelligenceMar-12-2025

Practical control systems pose significant challenges in identifying optimal control policies due to uncertainties in the system model and external disturbances. While $H_\infty$ control techniques are commonly used to design robust controllers that mitigate the effects of disturbances, these methods often require complex and computationally intensive calculations. To address this issue, this paper proposes a reinforcement learning algorithm called Robust Deterministic Policy Gradient (RDPG), which formulates the $H_\infty$ control problem as a two-player zero-sum dynamic game. In this formulation, one player (the user) aims to minimize the cost, while the other player (the adversary) seeks to maximize it. We then employ deterministic policy gradient (DPG) and its deep reinforcement learning counterpart to train a robust control policy with effective disturbance attenuation. In particular, for practical implementation, we introduce an algorithm called robust deep deterministic policy gradient (RDDPG), which employs a deep neural network architecture and integrates techniques from the twin-delayed deep deterministic policy gradient (TD3) to enhance stability and learning efficiency. To evaluate the proposed algorithm, we implement it on an unmanned aerial vehicle (UAV) tasked with following a predefined path in a disturbance-prone environment. The experimental results demonstrate that the proposed method outperforms other control approaches in terms of robustness against disturbances, enabling precise real-time tracking of moving targets even under severe disturbance conditions.

algorithm, disturbance, quadrotor, (15 more...)

arXiv.org Artificial Intelligence

2502.21057

Country: Asia > South Korea > Daejeon > Daejeon (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Robotics & Automation (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Probabilistic Artificial Intelligence

Krause, Andreas, Hübotter, Jonas

arXiv.org Artificial IntelligenceFeb-7-2025

Artificial intelligence commonly refers to the science and engineering of artificial systems that can carry out tasks generally associated with requiring aspects of human intelligence, such as playing games, translating languages, and driving cars. In recent years, there have been exciting advances in learning-based, data-driven approaches towards AI, and machine learning and deep learning have enabled computer systems to perceive the world in unprecedented ways. Reinforcement learning has enabled breakthroughs in complex games such as Go and challenging robotics tasks such as quadrupedal locomotion. A key aspect of intelligence is to not only make predictions, but reason about the uncertainty in these predictions, and to consider this uncertainty when making decisions. This is what this manuscript on "Probabilistic Artificial Intelligence" is about. The first part covers probabilistic approaches to machine learning. We discuss the differentiation between "epistemic" uncertainty due to lack of data and "aleatoric" uncertainty, which is irreducible and stems, e.g., from noisy observations and outcomes. We discuss concrete approaches towards probabilistic inference and modern approaches to efficient approximate inference. The second part of the manuscript is about taking uncertainty into account in sequential decision tasks. We consider active learning and Bayesian optimization -- approaches that collect data by proposing experiments that are informative for reducing the epistemic uncertainty. We then consider reinforcement learning and modern deep RL approaches that use neural network function approximation. We close by discussing modern approaches in model-based RL, which harness epistemic and aleatoric uncertainty to guide exploration, while also reasoning about safety.

bayesian inference, machine learning, optimization problem, (19 more...)

arXiv.org Artificial Intelligence

2502.05244

Country:

Europe (0.67)
North America > United States (0.67)

Genre:

Research Report > New Finding (0.45)
Research Report > Experimental Study (0.45)

Industry:

Leisure & Entertainment (1.00)
Health & Medicine (1.00)
Energy > Oil & Gas > Upstream (1.00)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
(6 more...)

Add feedback